Citation Network Analysis

This notebook is designed to give you all the information you need to get started analyzing citation networks--and networks of just about anything else--using Python, NetworkX, NumPy, and Matplotlib.

What I hope you take away from this presentation:

An elementary understanding of network analysis vocab.
An appreciation for NetworkX as a simple, mature, and powerful library for modelling and analyzing networks.
A straightfoward technological stack for contructing networks, analyzing them, and plotting information about them.
A lack of inhibition in turning to networkx and network analysis techniques in general problem solving.

What We'll Cover

We'll discuss the following points to help you get up and running quickly with the examples.

How NetworkX works, including how to create a network and load one from a json file.
How to make basic observations about a network, including plotting its degree distribution.
How to use link-analysis algorithms like PageRank to rank the sources in your network.
How to perform a linear regression analysis against an external data source to make sure nothing is horribly wrong with your graph and ranking calculations.
How to repeatedly run PageRank to get time-series data about the rankings of sources in your network.
How to fit each source's time series ranking with a regression line to determine whether it's trending positively or negatively.

About Me

I'm Thom Neale a Boston-based software developer for the nonprofit, nonpartisan Sunlight Foundation in Washington DC. Most of the work I do is web development, scraper code, and some back-end work. I'm a recovering attorney, so I experiment with legal information in Python whenever possible. This talk is an outgrowth of research I did for the Canadian Legal Information Institute to help them decide how to expand the coverage of their databases of Canadian court decisions.

Where to find materials for this Presentation

This notebook and the sample data it requires live on my github page at https://github.com/twneale/citation-network-analysis. To install locally, do the following:

$ virtualenv pydata2013
$ source ./pydata2013/bin/activate
$ git clone git@github.com:twneale/citation-network-analysis.git
$ cd citation-network-analysis
# Note: if the follwing step fails, you're missing dependencies. If resolving them is not you're cup of tea,
# I recommend installing the free anaconda distribution of python from Continuum Analytics: https://store.continuum.io/
$ pip install -r requirements.txt

Network Analysis Vocab

Know this jargon, or keep this list handy to refer to, and you're ready to dive into network analysis:

</hr>

graph: Another word for "network."
nodes: A node is an element in a graph that can connected to other nodes.
vertices: Another word for "nodes."
edges: An edge is a set of two nodes and represents a connection between them.
arcs: Another word for "edges."
adjacent: Two nodes constituting an edge are "adjacent" in the network.
context-free grammar: A recursive set of rules for tokenizing input and assembling the tokens into a parse tree.
degree: The degree of a node in a network is the number of connections it has to other nodes.
degree distribution: The degree distribution of a network is the probability distribution of the degree of all nodes in the network.
scale-free network: A network is scale-free if its degree distribution follows a power law.
random network: A network is random if its degree distribution is normal, or shaped like a bell curve.
degree centrality: The number of edges for a given node.
in-degree centrality: The number in-bound edges for a given node.
out-degree centrality: The number of out-bound edges for a given node
eigenvector centrality: A centrality measure that assigns relative centrality scores to all nodes in a network in a way to accords greater weight to connections to high scoring nodes.
betweenness centrality: A centrality measure equal to the number of shortest paths from all nodes to all others that pass through that node.
HITS authority score: A measure of the extent to which a case is cited by other cases that tend to cite authoritative cases.
HITS hub score: A measure of the extent to which a case tends to cite authoritative cases.
multiplicity: Occurs when one case cites another multiple times.
outliers: Cases cited many times within a corpus, potentially skewing calculations that fail to account for multiplicity.

Let's Write Some Code

Constructing a Network and Inspecting It



In [83]:

    
import networkx as nx

# Here's how you create a simple directed graph in NetworkX.
G = nx.DiGraph()

# A node can be any hashable object.
G.add_node('twneale')
G.add_node('daveneale')

# Edges are simply tuples of nodes. You associate abitrary data with an edge.
G.add_edge('twneale', 'daveneale', relationship="family")

# NetworkX has a plethora of useful methods for building networks.
G.add_edges_from([
    ('twneale', 'jamesturk'),                  
    ('twneale', 'paultag'),
    ('paultag', 'jamesturk')
    ])

# Networkx includes a "draw" function that plots the graph with matplotlib.
nx.draw(G)

Identifying Cliques

We can also use NetworkX to identify cliques in the graph. But note that the clique problem is NP-complete, so if the clique detection algorithm in NetworkX isn't suitable for your use case, you'll find yourself in a dilly of a pickle.



In [84]:

    
undirected = G.to_undirected()
for clique in nx.find_cliques(undirected):
    print clique
    
# Note that it does a pretty good job on our small sample graph, though.
# Note that identifying cliques is np-complete.









    



['twneale', 'paultag', 'jamesturk']
['twneale', 'daveneale']

Serializing and Deserializing from JSON

Networkx has functions to serialize and desrialize graphs toa number of formats, including JSON. Let's load a graph containing about 6000 citations between Supreme Court of Canada cases from the years 2000 through 2013.



In [88]:

    
# Networks can be serialized to, and loaded from, JSON and a number of other formats.
# Here we're loading a graph
from networkx.readwrite import json_graph 
with open('scc.json') as f:
    G = json_graph.load(f)

from itertools import islice

# Now we can look at some basic characteristics of the network.
print 'Number of nodes:', G.number_of_nodes()
print 'Number of edges:', G.number_of_edges()
 
# Let's view some of the nodes.
print 'Here are a few nodes'
for node in islice(G.nodes_iter(), 5):
    print ' -', node
    
# Let's view some of the edges. Note that they're just 2-tuples of nodes.
print 'Here are a few edges'
for edge in islice(G.edges_iter(), 5):
    print ' -', edge









    



Number of nodes: 4789
Number of edges: 6461
Here are a few nodes
 - 2012-scc-39
 - 1995-sccano-135
 - 2004-sccano-540
 - 2004-sccano-542
 - 2004-sccano-548
Here are a few edges
 - (u'2011scc65', u'2011-scc-65')
 - (u'2011scc65', u'2000scc2')
 - (u'2011scc65', u'2011-scc-64')
 - (u'2011scc65', u'2011scc64')
 - (u'2011scc65', u'2000-scc-2')

Sniff Around

Now we have a larger network to explore. Good science is good observation, so let's make whatever observations we can about this example graph of case citations. And as you saw above, simply printing a few nodes and edges is profoundly useless and tells us nothing about the data. Network data is challenging in that, even for small networks, there's really no easy way to understand your data without plotting it's characteristics and looking at some charts. Someday when you're having a panic attack trying to figure out how to make sense of gigbytes of graph data, remember: charts.

Plot Histograms that Might be Helpful for Your Data

What vizualizations your produce will depend on the data, but for purposes of basic exploration, you'll probably want to look at some histograms, and luckily numpy and matplotlib make doing so sinfully easy.



In [111]:

    
# Creating a historgram is a one-liner with matplotlib.
from matplotlib import pyplot as plt

values = [1, 2, 3, 2, 1, 2, 3, 4, 4, 3, 5, 3]
plt.hist(values, max(values))
plt.show()



In [118]:

    
# This histogram below shows the distribution of publication years of citing cases. 
# So we retrieved the publication year of the citing case for each network edge, 
# then plotted the distribution of those values.

from matplotlib import pyplot as plt
import numpy

years = tuple(G.get_edge_data(*edge)['year'] for edge in G.edges_iter())
plt.hist(years, 14)
plt.xlabel('Years')
plt.ylabel('Number of Cases')
plt.show()

How to Plot the Degree Distribution of a Network

Viewing a plot the degree distribution of a network is an important step in getting familiar with your data. If the degree distribution conforms to a power law distribution, the network is "scale-free", meaning a large number of edges are distributed among a small number nodes, and a small number of edges are distributed among a large number of nodes. In short, a few nodes are hogging all the connections.

Why bother with this? If the network is scale-free, we may be able to apply network analysis techniques that are known to work well on other scale-free networks, like the Internet.



In [120]:

    
degrees = G.degree().values()
h, bins= numpy.histogram(degrees, bins=100)
x_series = bins.compress(h)
y_series = h.compress(h)
plt.plot(x_series, y_series)
plt.xlabel('Number of cases with y inbound citations')
plt.ylabel('Number of inbound citations')
plt.show()

Here's how to find the highest ranked source and print the sources that cite it:



In [123]:

    
from operator import itemgetter
 
# As you can see, calculating PageRank values for our graph isn't too tricky.    
ranks = nx.pagerank(G)

# Let's retrieve the highest ranked case.
highest = sorted(ranks, key=ranks.get).pop()
first = itemgetter(0)
print 'The highest ranked case was %s (%r)' % (highest, ranks[highest])
print 'It was cited by these cases:'
print sorted(map(first, G.in_edges(highest)))









    



The highest ranked case was 2001-scc-2 (0.0013238817665870866)
It was cited by these cases:
[u'2001scc2', u'2001scc31', u'2001scc70', u'2001scc94', u'2002scc14', u'2002scc3', u'2002scc42', u'2002scc61', u'2002scc62', u'2002scc68', u'2002scc71', u'2002scc8', u'2002scc85', u'2003scc29', u'2003scc37', u'2003scc74', u'2004scc28', u'2004scc33', u'2004scc4', u'2004scc42', u'2004scc45', u'2004scc6', u'2005scc14', u'2005scc30', u'2005scc40', u'2005scc47', u'2005scc6', u'2005scc71', u'2007scc2', u'2007scc30', u'2008scc6', u'2009scc30', u'2010scc61', u'2011scc2', u'2011scc20', u'2011scc48', u'2011scc6', u'2012scc46', u'2012scc47', u'2013scc11']

Verifying Your Network Rankings Against an External Data Source

Another important step in verifying that your graph has the characteristics you expect is to test your data for a correlation with an external data source. Below we do a basic linear regression betweeh the number of pageviews each case recieved on CanLII's website and the number of inbound citations to each case (obtained via the G.in_degree() call):



In [157]:

    
import json
from scipy import stats
from sklearn.linear_model import LinearRegression
from matplotlib.lines import Line2D

# Now we load the top 1000 or so page view for Supreme Court of Canada Cases.
with open('pageviews.json') as f:
    pageviews = json.load(f)

ranks = G.in_degree()
ids = set(pageviews) & set(ranks)
x = map(pageviews.get, ids)
y = map(ranks.get, ids)

res = stats.linregress(x, y)
slope, intercept, r_value, p_value, std_err = res

print '  -', 'r_value', r_value
print '  -', 'p_value', p_value
print '  -', 'standard deviation', std_err









    



  - r_value 0.170498864748
  - p_value 0.000142396935727
  - standard deviation 0.000320759029387

We can see a weak correlation, which results from using a small subset of the full network as an example. When I ran this same regression using the indegree centrality scores of the full network of 1.2 million citation edges versus the top 1000 most popular cases from each jurisdiction, the correlation was veryu strong (over 0.40), as we might expect it to be. Another sign that nothing is horribly wrong with our graph! Keep those coming.



In [161]:

    
from collections import defaultdict 

def prune(G, year):
    get_edge_data = G.get_edge_data
    for edge in G.edges():
        _year = (get_edge_data(*edge) or {}).get('year')
        if _year and _year != year:
            G.remove_edge(*edge)
            
ranks = defaultdict(list)

for year in range(2000, 2014):
    H = G.copy()
    prune(H, year)
    for _id, rank in H.in_degree().items():
        ranks[_id].append(rank)



In [165]:

    
# Now we have indegree rankings for each case for each year:
import pprint

years = range(2000, 2014)
pprint.pprint(zip(years, ranks[highest]))









    



[(2000, 0),
 (2001, 4),
 (2002, 9),
 (2003, 3),
 (2004, 6),
 (2005, 6),
 (2006, 0),
 (2007, 2),
 (2008, 1),
 (2009, 1),
 (2010, 1),
 (2011, 4),
 (2012, 2),
 (2013, 1)]

Plotting Changes in the Network Ranking of a Case over Time

Now that we have time series data for the each case's network ranking over time, we can plot them and get an idea of whether each case is trending positively or negatively.



In [178]:

    
from itertools import dropwhile

def polyfit(x, y, degree):
    results = {}
    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot
    results['fit_fn'] = p

    return results

def plot(case, degree=1):
    vals = ranks[case]
    vals = zip(range(2000, 2014), vals)
    vals = list(dropwhile(lambda (year, deg): not deg, vals))

    x_series, y_series = zip(*vals)

    fig = plt.figure()

    # Plot the scatter.
    ax = fig.add_subplot(111)
    ax.plot(x_series, y_series, 'o')

    # Plot the regression line.
    result = polyfit(x_series, y_series, degree)
    ax = fig.add_subplot(111)
    ax.plot(x_series, result['fit_fn'](x_series), label='Degree %d, r2=%r' % (degree, result['determination']))
    #ax.fill_between(x_series, result['fit_fn'](x_series), facecolor='y', alpha=0.3)

    plt.title('Citation: %s' % case)
    ymax = max(y_series)
    plt.legend()
    plt.ylim(ymin=0, ymax=ymax + ymax * 0.1)
    plt.xlim(xmin=min(x_series), xmax=max(x_series))
    plt.show()

    
plot(highest, degree=1)



In [ ]: